问题设定驱动的深度强化学习研究:综述

doi:10.16451/j.cnki.issn1003-6059.202208004

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (1510 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要深度强化学习结合深度模型,广泛应用于智能控制、游戏竞技等领域.然而,现有强化学习的文献综述更多以某一难点为主深入梳理具体的方法技术,缺乏以问题本身为主的整体分析视角.现实问题总是混杂多个技术难点,而致力于解决单一难点的技术方法往往在具体问题场景上性能不及预期.因此,文中从智能体、任务、马尔可夫决策过程、策略类型、学习目标、交互模式这六大对象对问题设定进行定义,并以问题自身的设定为驱动,从整体上分析深度强化学习的研究现状、基础设定及其延伸设定.再梳理深度强化学习的发展脉络,分析关键技术和背后的主要动机.然后,以专家交互这类问题设定为例,提供一个以具体问题驱动的技术视角去整体看待该领域的发展趋势.最后介绍当前的研究热点并展望今后的研究方向.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张政锋
	赵彬琦
	单洪明
	张军平

关键词 ：人工智能, 深度强化学习, 问题设定, 智能控制

Abstract：Combined with deep models, deep reinforcement learning(RL) is widely applied in various fields such as intelligent control and game competition. However, the existing RL surveys mainly focus on some core difficulty and neglect the analysis of problem itself from an overall perspective. The practical application in real-world scenarios is confronted with many technical challenges , and the technical approaches for a particular problem are not as good as expected for specific scenarios. Therefore, problem setting is defined in this paper from six major aspects, including agent, task distribution, Markov decision process, policy class, learning objective and interaction mode. A problem setting-driven perspective is utilized to analyze overall research status, elementary and extended RL setting. Then, development direction, key technologies and main motivation of the current deep RL are further discussed. Moreover, expert interaction is taken as an example to further analyze the development trends of the field in general from the problem setting-driven perspective. Finally, hot topics and future directions for the field are proposed.

Key words： Artificial Intelligence Deep Reinforcement Learning Problem Setting Intelligent Control

收稿日期: 2022-06-27

ZTFLH:

TP181

基金资助:国家重点研究开发项目(No.2018YFB1305104)、国家自然科学基金项目(No.62176059)、教育部科技创新专项项目(No.000011)资助

通讯作者: 张军平,博士,教授,主要研究方向为机器学习、智能交通系统、生物认证、图像识别.E-mail:jpzhang@fudan.edu.cn.

作者简介: 张政锋,硕士研究生,主要研究方向为机器学习、强化学习、游戏AI.E-mail:zfzhang19@fudan.edu.cn.
赵彬琦,硕士研究生,主要研究方向为机器学习、强化学习、游戏AI.E-mail:bqzhao20@fudan.edu.cn.
单洪明,博士,研究员,主要研究方向为机器学习、医学影像.E-mail:hmshan@fudan.edu.cn.

引用本文:

张政锋, 赵彬琦, 单洪明, 张军平. 问题设定驱动的深度强化学习研究:综述[J]. 模式识别与人工智能, 2022, 35(8): 718-742. ZHANG Zhengfeng, ZHAO Binqi, SHAN Hongming, ZHANG Junping. A Survey of Problem Setting-Driven Deep Reinforcement Learning. Pattern Recognition and Artificial Intelligence, 2022, 35(8): 718-742.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202208004 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2022/V35/I8/718

[1] SUTTON R S, BARTO A G.Reinforcement Learning: An Introduction. Cambridge, USA: MIT Press, 1998.
[2] PUTERMAN M L.Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York, USA: John Wiley & Sons, 2014.
[3] SILVER D, HUANG A, MADDISON C J, et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 2016, 529(7587): 484-489.
[4] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster Level in Starcraft II Using Multi-agent Reinforcement Lear-ning. Nature, 2019, 575(7782): 350-354.
[5] DEGRAVE J, FELICI F, BUCHLI J, et al. Magnetic Control of Tokamak Plasmas through Deep Reinforcement Learning. Nature, 2022, 602(7897): 414-419.
[6] 万里鹏,兰旭光,张翰博,等.深度强化学习理论及其应用综述.模式识别与人工智能, 2019, 32(1): 67-81.
(WAN L P, LAN X G, ZHANG H B, et al. A Review of Deep Reinforcement Learning Theory and Application. Pattern Recognition and Artificial Intelligence, 2019, 32(1): 67-81.)
[7] 高阳,陈世福,陆鑫.强化学习研究综述.自动化学报, 2004, 30(1): 86-100.
(GAO Y, CHEN S F, LU X.Research on Reinforcement Learning Technology: A Review. Acta Automatica Sinica, 2004, 30(1): 86-100.)
[8] 刘全,翟建伟,章宗长,等.深度强化学习综述.计算机学报, 2018, 41(1): 1-27.
(LIU Q, ZHAI J W, ZHANG Z C, et al. A Survey on Deep Reinforcement Learning. Chinese Journal of Computers, 2018, 41(1): 1-27)
[9] 孙世光,兰旭光,张翰博,等.基于模型的机器人强化学习研究综述.模式识别与人工智能, 2022, 35(1): 1-16.
(SUN S G, LAN X G, ZHANG H B, et al. Model-Based Reinforcement Learning in Robotics: A Survey. Pattern Recognition and Artificial Intelligence, 2022, 35(1): 1-16.)
[10] LEVINE S, KUMAR A, TUCKER G, et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems[C/OL].[2022-05-12]. https://arxiv.org/pdf/2005.01643.pdf.
[11] BERTSEKAS D P.Reinforcement Learning and Optimal Control. Nashua, USA: Athena Scientific, 2019.
[12] PATERIA S, SUBAGDJA B, TAN A, et al. Hierarchical Reinforcement Learning: A Comprehensive Survey. ACM Computing Surveys, 2022, 54(5): 1-35.
[13] SÆMUNDSSON S, HOFMANN K, DEISENROTH M P. Meta Reinforcement Learning with Latent Variable Gaussian Processes[C/OL]. [2022-05-12]. http://auai.org/uai2018/proceedings/papers/235.pdf.
[14] MINSKY M.Steps toward Artificial Intelligence. Proceedings of the IRE, 1961, 49(1): 8-30.
[15] SUTTON R S.Temporal Credit Assignment in Reinforcement Lear-ning. Ph.D. Dissertation. Amherst, USA: University of Massachusetts Amherst, 1984.
[16] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing ATARI with Deep Reinforcement Learning[C/OL]. [2022-05-12]. https://arxiv.org/pdf/1312.5602.pdf.
[17] YE D H, CHEN G B, ZHANG W, et al.Towards Playing Full MOBA Games with Deep Reinforcement Learning // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 621-632.
[18] KALASHNIKOV D, IRPAN A, PASTOR P, et al. QT-Opt: Sca-lable Deep Reinforcement Learning for Vision-Based Robotic Manipulation[C/OL].[2022-05-12]. https://arxiv.org/pdf/1806.10293.pdf.
[19] TANG X C, QIN Z W, ZHANG F, et al. A Deep Value-Network Based Approach for Multi-driver Order Dispatching // Proc of the 25th ACM SIGKOD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2019: 1780-1790.
[20] CHEN X Y, WANG C, ZHOU Z J, et al. Randomized Ensembled Double Q-Learning: Learning Fast without a Model[C/OL].[2022-05-12]. https://arxiv.org/pdf/2101.05982.pdf.
[21] SMITH L, KEW J C, PENG X B, et al. Legged Robots That Keep on Learning: Fine-Tuning Locomotion Policies in the Real World // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2022: 1593-1599.
[22] THORNDIKE E L.Review of Animal Intelligence: An Experimental Study of the Associative Processes in Animals. Psychological Monographs: General and Applied, 1970, 5: 551-553.
[23] SKINNER W K. Teaching Machines. Education and Training, 1961, 3(10): 20-22.
[24] THOMPSON W R.On the Likelihood That One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 1933, 25(3/4): 285-294.
[25] SUTTON R S.Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming // Proc of the 7th International Conference on Machine Learning. New York, USA: ACM, 1990: 216-224.
[26] WILLIAMS R J.Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 1992, 8(3/4): 229-256.
[27] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Proceedings of Machine Learning Research, 2018, 80: 1861-1870.
[28] LE PAINE T, GULCEHRE C, SHAHRIARI B, et al. Making Efficient Use of Demonstrations to Solve Hard Exploration Problems[C/OL]. [2022-05-12]. https://arxiv.org/pdf/1909.01387.pdf
[29] ECOFFET A, HUIZINGA J, LEHMAN J, et al. First Return, Then Explore. Nature, 2021, 590(7847): 580-586.
[30] AMIN S, GOMROKCHI M, SATIJA H, et al. A Survey of Exploration Methods in Reinforcement Learning[C/OL].[2022-05-12]. https://arxiv.org/pdf/2109.00157.pdf.
[31] AZAR M G, OSBAND I, MUNOS R.Minimax Regret Bounds for Reinforcement Learning // Proc of the 34th International Confe-rence on Machine Learning. New York, USA: ACM, 2017: 263-272.
[32] AUER P, JAKSCH T, ORTNER R.Near-Optimal Regret Bounds for Reinforcement Learning // Proc of the 21st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2008: 89-96.
[33] MUNOS R, SZEPESVÁRI C. Finite-Time Bounds for Fitted Value Iteration. Journal of Machine Learning Research, 2008, 9: 815-857.
[34] ERNST D, GEURTS P, WEHENKEL L.Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Research, 2005, 6: 503-556.
[35] JIANG N, KRISHNAMURTHY A, AGARWAL A, et al. Contextual Decision Processes with Low Bellman Rank Are Pac-Learnable // Proc of the 34th International Conference on Machine Learning. New York, USA: ACM, 2017: 1704-1713.
[36] RUSSO D, VAN ROY B.Eluder Dimension and the Sample Complexity of Optimistic Exploration[C/OL]. [2022-05-12].https://web.stanford.edu/~bvr/pubs/Eluder.pdf.
[37] AGARWAL A, KAKADE S M, LEE J D, et al. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. Journal of Machine Learning Research, 2021, 22(1): 4431-4506.
[38] ABDOLMALEKI A, SPRINGENBERG J T, TASSA Y, et al. Maxi-mum a Posteriori Policy Optimisation[C/OL].[2022-05-12]. https://arxiv.org/pdf/1806.06920.pdf.
[39] KAKADE S.A Natural Policy Gradient // Proc of the 14th International Conference on Neural Information Processing Systems(Natural and Synthetic). Cambridge, USA: MIT Press, 2001: 1531-1538.
[40] ABBASI-YADKORI Y, BARTLETT P L, BHATIA K, et al. Politex: Regret Bounds for Policy Iteration Using Expert Prediction // Proc of the 36th International Conference on Machine Learning. New York, USA: ACM, 2019: 3692-3702.
[41] SCHULMAN J, LEVINE S, MORITZ P, et al. Trust Region Policy Optimization // Proc of the 32nd International Conference on Machine Learning. New York, USA: ACM, 2015: 1889-1897.
[42] DEGRIS T, WHITE M, SUTTON R S. Off-Policy Actor-Critic[C/OL]. [2022-05-12]. https://arxiv.org/pdf/1205.4839.pdf.
[43] GLYNN P W, IGLEHART D L.Importance Sampling for Stochastic Simulations. Management Science, 1989, 35(11): 1367-1392.
[44] DE ASIS K, HERNANDEZ-GARCIA J F, HOLLAND G Z, et al. Multi-step Reinforcement Learning: A Unifying Algorithm // Proc of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto, USA: AAAI, 2018: 2902-2909.
[45] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-Level Control through Deep Reinforcement Learning. Nature, 2015, 518(7540): 529-533.
[46] HAUSKNECHT M, STONE P.Deep Recurrent Q-Learning for Partially Observable Mdps[C/OL]. [2022-05-12]. https://arxiv.org/pdf/1507.06527.pdf
[47] IGL M, ZINTGRAF L, LE T A, et al. Deep Variational Reinforcement Learning for POMDPs // Proc of the 35th International Conference on Machine Learning. New York, USA: ACM, 2018: 2117-2126.
[48] WANG Z Y, SCHAUL T, HESSEL M, et al. Dueling Network Architectures for Deep Reinforcement Learning // Proc of the 33rd International Conference on Machine Learning. New York, USA: ACM, 2016: 1995-2003.
[49] BELLEMARE M G, DABNEY W, MUNOS R.A Distributional Perspective on Reinforcement Learning // Proc of the 34th International Conference on Machine Learning. New York, USA: ACM, 2017: 449-458.
[50] DABNEY W, ROWLAND M, BELLEMARE M G, et al. Distributional Reinforcement Learning with Quantile Regression // Proc of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto, USA: AAAI, 2018: 2892-2901.
[51] ZAHAVY T, HAROUSH M, MERLIS N, et al.Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning // Proc of the 32nd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2018: 3566-3577.
[52] OSBAND I, BLUNDELL C, PRITZEL A, et al.Deep Exploration via Bootstrapped DQN // Proc of the 30th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2016: 4033-4041.
[53] FORTUNATO M, AZAR M G, PIOT B, et al. Noisy Networks for Exploration[C/OL].[2022-05-12]. https://arxiv.org/pdf/1706.10295.pdf.
[54] HESSEL M, MODAYIL J, VAN HASSELT H, et al. Rainbow: Combining Improvements in Deep Reinforcement Learning[C/OL].[2022-05-12]. https://arxiv.org/pdf/1710.02298v1.pdf.
[55] SCHAUL T, QUAN J, ANTONOGLOU I, et al. Prioritized Experience Replay[C/OL].[2022-05-12]. https://arxiv.org/pdf/1511.05952.pdf.
[56] LEE D H, DEFOURNY B, POWELL W B.Bias-Corrected Q-Lear-ning to Control Max-Operator bias in Q-Learning // Proc of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. Washington, USA: IEEE, 2013: 93-99.
[57] VAN HASSELT H, DORON Y, STRUB F, et al. Deep Reinforcement Learning and the Deadly Triad[C/OL].[2022-05-12]. https://arxiv.org/pdf/1812.02648.pdf.
[58] FUJIMOTO S, VAN HOOF H, MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[C/OL]. [2022-05-12].https://arxiv.org/pdf/1802.09477.pdf.
[59] JIANG H B, XIE J, YANG J.Action Candidate Based Clipped Double Q-learning for Discrete and Continuous Action Tasks[C/OL]. [2022-05-12].https://arxiv.org/pdf/2105.00704.pdf.
[60] KUZNETSOV A, SHVECHIKOV P, GRISHIN A, et al. Contro-lling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics // Proc of the 37th International Conference on Machine Learning. New York, USA: ACM, 2020: 5556-5566.
[61] PEER O, TESSLER C, MERLIS N, et al. Ensemble Bootstra-pping for Q-Learning // Proc of the 38th International Conference on Machine Learning. New York, USA: ACM, 2021: 8454-8463.
[62] LI Z N, LI Y R, ZHANG Y S, et al. HyperDQN: A Randomized Exploration Method for Deep Reinforcement Learning[C/OL].[2022-05-12]. https://openreview.net/pdf?id=X0nrKAXu7g-.
[63] MARBACH P, TSITSIKLIS J N.Simulation-Based Optimization of Markov Reward Processes. IEEE Transactions on Automatic Control, 2001, 46(2): 191-209.
[64] LIU Y, RAMACHANDRAN P, LIU Q, et al. Stein Variational Policy Gradient[C/OL]. [2022-05-12]. http://auai.org/uai2017/proceedings/papers/239.pdf.
[65] WU Y H, MANSIMOV E, LIAO S, et al.Scalable Trust-Region Method for Deep Reinforcement Learning Using Kronecker-Factored Approximation // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 5285-5294.
[66] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous Methods for Deep Reinforcement Learning // Proc of the 33rd International Conference on Machine Learning. New York, USA: ACM, 2016: 1928-1937.
[67] ESPEHOLT L, SOYER H, MUNOS R, et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures // Proc of the 35th International Conference on Machine Learning. New York, USA: ACM, 2018: 1407-1416.
[68] HORGAN D, QUAN J, BUDDEN D, et al. Distributed Prioritized Experience Replay[C/OL].[2022-05-12]. https://arxiv.org/pdf/1803.00933.pdf.
[69] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal Policy Optimization Algorithms[C/OL].[2022-05-12]. https://arxiv.org/pdf/1707.06347.pdf.
[70] WANG Z Y, BAPST V, HEESS N, et al. Sample Efficient Actor-Critic with Experience Replay[C/OL].[2022-05-12]. https://arxiv.org/pdf/1611.01224.pdf.
[71] SILVER D, LEVER G, HEESS N, et al. Deterministic Policy Gradient Algorithms // Proc of the 31st International Conference on Machine Learning. New York, USA: ACM, 2014: 387-395.
[72] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous Control with Deep Reinforcement Learning[C/OL]. [2022-05-12]. https://arxiv.org/pdf/1509.02971.pdf
[73] DUAN J L, GUAN Y, LI S E, et al. Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors. IEEE Transactions on Neural Networks and Learning Systems, 2021. DOI: 10.1109/TNNLS.2021.3082568.
[74] AGARWAL A, HENAFF M, KAKADE S, et al. PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning[C/OL].[2022-05-12]. https://arxiv.org/pdf/2007.08459.pdf.
[75] COBBE K W, HILTON J, KLIMOV O, et al. Phasic Policy Gradient // Proc of the 38th International Conference on Machine Lear-ning. New York, USA: ACM, 2021: 2020-2027.
[76] LIU Y, SWAMINATHAN A, AGARWAL A, et al. Off-Policy Po-licy Gradient with Stationary Distribution Correction // Proc of the 35th Uncertainty in Artificial Intelligence Conference. New York, USA: ACM, 2020: 1180-1190.
[77] LYU J, MA X T, YAN J P, et al. Efficient Continuous Control with Double Actors and Regularized Critics. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(7), 7655-7663.
[78] BELLEMARE M G, NADDAF Y, VENESS J, et al. The Arcade Learning Environment: An Evaluation Platform for General Agents // Proc of the 24th International Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2015: 4148-4152.
[79] TODOROV E, EREZ T, TASSA Y.MuJoCo: A Physics Engine for Model-Based Control // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington, USA: IEEE, 2012: 5026-5033.
[80] BEATTIE C, LEIBO J Z, TEPLYASHIN D, et al. DeepMind Lab[C/OL].[2022-05-12]. https://arxiv.org/pdf/1612.03801.pdf.
[81] OSBAND I, DORON Y, HESSEL M, et al. Behaviour Suite for Reinforcement Learning[C/OL].[2022-05-12]. https://arxiv.org/pdf/1908.03568.pdf.
[82] VINYALS O, EWALDS T, BARTUNOV S, et al. Starcraft II: A New Challenge for Reinforcement Learning[C/OL].[2022-05-12]. https://arxiv.org/pdf/1708.04782.pdf.
[83] DOSOVITSKIY A, ROS G, CODEVILLA F, et al. CARLA: An Open Urban Driving Simulator // Proc of the 1st Annual Confe-rence on Robot Learning. New York, USA: ACM, 2017: 1-16.
[84] COUMANS E, BAI Y.PyBullet, A Python Module for Physics Simulation for Games, Robotics and Machine Learning[DB/OL]. [2022-05-12].https://github.com/bulletphysics/bullet3.
[85] RAFFIN A, HILL A, GLEAVE A, et al. Stable-Baselines3: Reliable Reinforcement Learning Implementations. Journal of Machine Learning Research, 2021, 22: 1-8.
[86] HAYES C F, RǍDULESCU R, BARGIACCHI E, et al. A Practical Guide to Multi-objective Reinforcement Learning and Planning. Autonomous Agents and Multi-agent Systems, 2022, 36(1). DOI: 10.1007/s10458-022-09552-y.
[87] BACON P L, HARB J, PRECUP D.The Option-Critic Architecture[C/OL]. [2022-05-12]. https://arxiv.org/pdf/1609.05140.pdf
[88] GHASEMIPOUR S K S, ZEMEL R, GU S X. A Divergence Minimization Perspective on Imitation Learning Methods // Proc of the Conference on Robot Learning. New York, USA: ACM, 2020: 1259-1277.
[89] NAGABANDI A, CLAVERA I, LIU S M, et al. Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning[C/OL]. [2022-05-12]. http://arxiv.org/pdf/1803.11347v6.pdf.
[90] ABEL D.A Theory of State Abstraction for Reinforcement Learning // Proc of the 33rd AAAI Conference on Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conference and 9th AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto, USA: AAAI, 2019: 9876-9877.
[91] FINN C, ABBEEL P, LEVINE S.Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks // Proc of the 34th International Conference on Machine Learning. New York, USA: ACM, 2017: 1126-1135.
[92] 史腾飞,王莉,黄子蓉.序列多智能体强化学习算法.模式识别与人工智能, 2021, 34(3): 206-213.
(SHI T F, WANG L, HUANG Z R.Sequence to Sequence Multi-agent Reinforcement Learning Algorithm. Pattern Recognition and Artificial Intelligence. 2021, 34(3): 206-213.)
[93] NAIR A, MCGREW B, ANDRYCHOWICZ M, et al. Overcoming Exploration in Reinforcement Learning with Demonstrations // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2018: 6292-6299.
[94] HESTER T, VECERIK M, PIETQUIN O, et al. Deep Q-Learning from Demonstrations[C/OL].[2022-05-12]. https://arxiv.org/pdf/1704.03732v4.pdf.
[95] NAIR A, GUPTA A, DALAL M, et al. AWAC: Accelerating Online Reinforcement Learning with Offline Datasets[C/OL].[2022-05-12]. https://arxiv.org/pdf/2006.09359.pdf.
[96] FEDUS W, RAMACHANDRAN P, AGARWAL R, et al. Revisiting Fundamentals of Experience Replay[C/OL].[2022-05-12]. https://proceedings.mlr.press/v119/fedus20a.html.
[97] GAO Y, XU H Z, LIN J, et al. Reinforcement Learning from Imperfect Demonstrations[C/OL].[2022-05-12]. https://arxiv.org/pdf/1802.05313.pdf.
[98] FUJIMOTO S, GU S S.A Minimalist Approach to Offline Reinforcement Learning[C/OL]. [2022-05-12].https://arxiv.org/pdf/2106.06860v2.pdf.
[99] QURESHI A H, BOOTS B, YIP M C.Adversarial Imitation via Variational Inverse Reinforcement Learning[C/OL]. [2022-05-12].https://arxiv.org/pdf/1809.06404.pdf.
[100] AGARWAL R, SCHUURMANS D, NOROUZI M.An Optimistic Perspective on Offline Reinforcement Learning // Proc of the 37th International Conference on Machine Learning. New York, USA: ACM, 2020: 104-114.
[101] RASHID T, SAMVELYAN M, DE WITT C S, et al. QMIX: Monotonic Value Function Factorisation for Deep Multi-agent Reinforcement Learning // Proc of the 35th International Conference on Machine Learning. New York, USA: ACM, 2018: 4295-4304.
[102] QIN R J, GAO S Y, ZHANG X Y, et al. NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning[C/OL].[2022-05-12]. https://arxiv.org/pdf/2102.00714.pdf.
[103] 倪志伟,刘浩,朱旭辉,等.基于深度强化学习的空间众包任务分配策略.模式识别与人工智能, 2021, 34(3): 191-205.
(NI Z W, LIU H, ZHU X H, et al. Task Allocation Strategy of Spatial Crowdsourcing Based on Deep Reinforcement Learning. Pattern Recognition and Artificial Intelligence, 2021, 34(3): 191-205.)
[104] AFSHAR R R, ZHANG Y Q, VANSCHOREN J, et al. Automated Reinforcement Learning: An Overview[C/OL].[2022-05-12]. https://arxiv.org/pdf/2201.05000.pdf.
[105] 刘会东,杜方,余振华,等.基于强化学习的无标签网络剪枝.模式识别与人工智能, 2021, 34(3): 214-222.
(LIU H D, DU F, YU Z H, et al. Label-Free Network Pruning via Reinforcement Learning. Pattern Recognition and Artificial Intelligence, 2021, 34(3): 214-222.)
[106] REED S, ZOLNA K, PARISOTTO E, et al. A Generalist Agent[C/OL].[2022-05-12]. https://arxiv.org/pdf/2205.06175.pdf.
[107] 吴兴宇,江兵兵,吕胜飞,等.基于马尔科夫边界发现的因果特征选择算法综述.模式识别与人工智能, 2022, 35(5): 422-438.
(WU X Y, JIANG B B, LÜ S F, et al. A Survey on Causal Feature Selection Based on Markov Boundary Discovery. Pattern Recognition and Artificial Intelligence, 2022, 35(5): 422-438.)
[108] 方宝富,马云婷,王在俊,等.稀疏奖励下基于情感的异构多智能体强化学习.模式识别与人工智能, 2021, 34(3): 223-231.
(FANG B F, MA Y T, WANG Z J, et al. Emotion-Based Heterogeneous Multi-agent Reinforcement Learning with Sparse Reward. Pattern Recognition and Artificial Intelligence. 2021, 34(3): 223-231.)